Setting Up Your Computer

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R. Each package (or library) adds additional functionality to the base functionality of R. For instance sf allows us to read and write shapefiles, and do lots of core GIS operations, and mapview allows us to create interative maps. The large number of packages available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of the language in geography.

Before we begin our exercise, we need to install all the needed packages - we only need to use install.packages() once the first time you use a new package.

In this case I have already installed all the needed packages, so we don’t need to run the next block of code.

# DO NOT RUN INSTALL PACKAGES THIS HAS BEEN RUN FOR YOU ALREADY
# install required packages
#install.packages("tidyverse")
#install.packages("sf")
#install.packages("mapview")
#install.packages("dplyr")

EVERY time you start using R you will need to run library() to load the required packages that have already been installed.

#load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.0 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(sf)
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(mapview)
library(dplyr)

Basic Data Structions

The most straight forward way to store a list of numbers is through an assignment using the c() command (c stands for “combine”). The idea is that a list of numbers is stored under a given name, and the name is used to refer to the data. A list is specified with the c() command, and assignment is specified with the “<-” or “=” symbols. Another term used to describe the list of numbers is to call it a “vector”.

The each element within the c() command is separated by commas.

As an example, we can create a new variable, called bubba which will contain the numbers 3, 5, 7, and 9:

bubba = c(3,5,7,9)

When you enter this command you should not see any output except a new command line. The command creates a list of numbers called bubba. To see what numbers is included in bubba type bubba and press the shift+control+enter key:

bubba
## [1] 3 5 7 9

If you wish to work with one of the numbers you can get access to it using the variable and then square brackets indicating which number. Try the example below:

bubba[2]
## [1] 5
bubba[1]
## [1] 3

Notice that the first entry is referred to as the number 1 entry.

You can store strings using both single and double quotes, and you can store real numbers.

bubba = c('HI','1','some text')
bubba
## [1] "HI"        "1"         "some text"

Notice that we reassigned our original numeric vector as a character vector.

Try replacing a single element

bubba[3] = 'more meaningful text'
bubba
## [1] "HI"                   "1"                    "more meaningful text"

Notice that you can’t mix data types in a single vector

bubba_test = c(1, 'hi','2')
bubba_test    # it simply converts 1 to '1' stored as a character
## [1] "1"  "hi" "2"

Reading in a CSV file

Unfortunately, it is rare to have just a few data points that you do not mind typing in at the prompt. It is much more common to have a lot of data points with complicated relationships. Here we will examine how to read a data set from a file using the read.csv function but first discuss the format of a data file.

We assume that the data file is in the format called “comma separated values” (csv). That is, each line contains a row of values which can be numbers or letters, and each value is separated by a comma. We also assume that the very first row contains a list of labels. The idea is that the labels in the top row are used to refer to the different columns of values.

First we read a very short data file. The data file is called cities.csv and has six columns of data and six rows. Each row is a city in the United States.

cities.csv

City State Population Area Lat Lon
New York New York 8467513 300.5 40.66 73.93
Los Angeles California 3849297 469.5 34.01 18.41
Chicago Illinois 2696555 227.7 41.83 87.68
Houston Texas 2288250 640.4 29.78 95.39
Phoenix Arizona 1624569 518 33.57 12.09
Philadelphia Pennsylvania 1576251 134.4 40 75.13

The command to read the data file is read.csv(). We have to give the command at least one arguments, but we will give three different arguments to indicate how the command can be used in different situations.

When we read in our files we are going to use “relative paths” meaning we start in the folder/directory that this .rmd file is stored using "./" which is like saying “here”, then we are going into the data folder from there, and then refer to the simple.csv file stored there.

The following command will read in the data and assign it to a variable called US. We set head=TRUE make sure we keep the first row holding the column names, and sep="," to indicate that this is a comma delimted text file.

US = read.csv(file="./data/cities.csv",head=TRUE,sep=",")

We can now see what’s inside our new US object:

US
##   X.1 X         City        State Population  Area   Lat     Lon
## 1   1 1     New York     New York    8467513 300.5 40.66  -73.93
## 2   2 2  Los Angeles   California    3849297 469.5 34.01 -118.41
## 3   3 3      Chicago     Illinois    2696555 227.7 41.83  -87.68
## 4   4 4      Houston        Texas    2288250 640.4 29.78  -95.39
## 5   5 5      Phoenix      Arizona    1624569 518.0 33.57 -112.09
## 6   6 6 Philadelphia Pennsylvania    1576251 134.4 40.00  -75.13

You can also use head and tail to look at the top and bottom of your file:

head(US, n=2)
##   X.1 X        City      State Population  Area   Lat     Lon
## 1   1 1    New York   New York    8467513 300.5 40.66  -73.93
## 2   2 2 Los Angeles California    3849297 469.5 34.01 -118.41
tail(US,n=3)
##   X.1 X         City        State Population  Area   Lat     Lon
## 4   4 4      Houston        Texas    2288250 640.4 29.78  -95.39
## 5   5 5      Phoenix      Arizona    1624569 518.0 33.57 -112.09
## 6   6 6 Philadelphia Pennsylvania    1576251 134.4 40.00  -75.13

To get more information on the different options available you can use the help ‘?’ command:

?read.csv

The object US contains the three columns of data. Each column is assigned a name based on the header (the first line in the file). You can now access each individual column using a $ to separate the two names:

US$City
## [1] "New York"     "Los Angeles"  "Chicago"      "Houston"      "Phoenix"     
## [6] "Philadelphia"
US$Population
## [1] 8467513 3849297 2696555 2288250 1624569 1576251

Manipulating Data

Adding new columns

R can quickly and easily generate new columns, select values, etc. This will help you edit attribute tables quickly on the fly. For instance if we want to create a new column based on another column values we simply use $ to create a new column name and $ to access an existing column in our data.frame called US.

In this case we are going to create a new column called pop_den in the US dataset using US$pop_den =. We are going to assign it the values of the Population column divided by Area.

Note: * is used for multiplication and \ is used for division.

US$pop_den = US$Population / US$Area  
US
##   X.1 X         City        State Population  Area   Lat     Lon   pop_den
## 1   1 1     New York     New York    8467513 300.5 40.66  -73.93 28178.080
## 2   2 2  Los Angeles   California    3849297 469.5 34.01 -118.41  8198.716
## 3   3 3      Chicago     Illinois    2696555 227.7 41.83  -87.68 11842.578
## 4   4 4      Houston        Texas    2288250 640.4 29.78  -95.39  3573.157
## 5   5 5      Phoenix      Arizona    1624569 518.0 33.57 -112.09  3136.234
## 6   6 6 Philadelphia Pennsylvania    1576251 134.4 40.00  -75.13 11728.058

Subsetting attributes

One we often want to remove unwanted columns from our data. Let’s say we wanted to keep only a few of our columns. To do this we can use the subset() function. Here we are creating a vector of variable names we want to select using c().

US_subset = subset(US, select = c(City,State,pop_den, Lat,Lon ))

Spatial Features

In this section we will be introducing the use of spatial data types. Spatial vector data (points, lines, polygons) is a simple extension of non-spatial ‘dataframes’ we were using above. Essentially we keep the data in the same order, but we add a few things including, most importantly, a column called geometry which in this case will hold the Lat Lon pairs for each US city in our dataset.

We can assign a new geometry to our data by converting our dataframe into a sf (simple feature) datatype with the function st_as_sf(), which needs our original data US and the name of columns holding the coords (coordinates) for our points.

US_sp = st_as_sf(US, coords = c("Lon","Lat") )
US_sp
## Simple feature collection with 6 features and 7 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -118.41 ymin: 29.78 xmax: -73.93 ymax: 41.83
## CRS:           NA
##   X.1 X         City        State Population  Area   pop_den
## 1   1 1     New York     New York    8467513 300.5 28178.080
## 2   2 2  Los Angeles   California    3849297 469.5  8198.716
## 3   3 3      Chicago     Illinois    2696555 227.7 11842.578
## 4   4 4      Houston        Texas    2288250 640.4  3573.157
## 5   5 5      Phoenix      Arizona    1624569 518.0  3136.234
## 6   6 6 Philadelphia Pennsylvania    1576251 134.4 11728.058
##                geometry
## 1  POINT (-73.93 40.66)
## 2 POINT (-118.41 34.01)
## 3  POINT (-87.68 41.83)
## 4  POINT (-95.39 29.78)
## 5 POINT (-112.09 33.57)
## 6     POINT (-75.13 40)

We can see that the geometry column has changed.

US_sp$geometry
## Geometry set for 6 features 
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -118.41 ymin: 29.78 xmax: -73.93 ymax: 41.83
## CRS:           NA
## First 5 geometries:
## POINT (-73.93 40.66)
## POINT (-118.41 34.01)
## POINT (-87.68 41.83)
## POINT (-95.39 29.78)
## POINT (-112.09 33.57)

Let’s take a look at our city points on a map to see if it worked.

mapview(US_sp)

Hmmm… something is clearly wrong. I think we are in the ocean here.

We told US_sp what the lat and lon of the points were.

So what is wrong?

The issue here is that we haven’t told US_sp what coordinate reference system (CRS) we are using. Is this a spherical projection using lat lon? Or is it a projected coordinate system measured in meters? What is the prime meridian? Where is the origin (0,0)?

Where is my CRS?

Here’s an example of two very different CRS definitions. At the moment we haven’t told it which one to use…

Yes but what coordinate system?

Ok let’s try again, from scratch. This time we will assign the coords as well as the coordinate reference system crs. When programming we typically use something called EPSG codes to define a crs. Each projection or geographic crs is assigned a unique code - a EPSG code.

US_sp = st_as_sf(US, 
                coords = c("Lon","Lat"),
                crs=4326)
US_sp
## Simple feature collection with 6 features and 7 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -118.41 ymin: 29.78 xmax: -73.93 ymax: 41.83
## Geodetic CRS:  WGS 84
##   X.1 X         City        State Population  Area   pop_den
## 1   1 1     New York     New York    8467513 300.5 28178.080
## 2   2 2  Los Angeles   California    3849297 469.5  8198.716
## 3   3 3      Chicago     Illinois    2696555 227.7 11842.578
## 4   4 4      Houston        Texas    2288250 640.4  3573.157
## 5   5 5      Phoenix      Arizona    1624569 518.0  3136.234
## 6   6 6 Philadelphia Pennsylvania    1576251 134.4 11728.058
##                geometry
## 1  POINT (-73.93 40.66)
## 2 POINT (-118.41 34.01)
## 3  POINT (-87.68 41.83)
## 4  POINT (-95.39 29.78)
## 5 POINT (-112.09 33.57)
## 6     POINT (-75.13 40)

To learn more about EPSG codes and crs more generally refer here.

Ok let’s see if that fixed things.

mapview(US_sp)

Perfect! Ok actually its just ok. Let’s see if we can make it more interesting. Let’s use zcol to tell mapview which column it should plot data from:

mapview(US_sp, zcol='Population')

That’s pretty good. But colors aren’t a great way to communicate the size of a city. Instead let’s change the size of the points based on their population using cex instead

mapview(US_sp, cex='Population')

Huh.. Cool. Maybe we can improve things still. Let’s assign size and color based on Population and add “moust-over” labels based on the city name.

mapview(US_sp, cex='Population', zcol='Population', label='City')

Reading/Writing Shapefiles

The sf library can also handle shapefiles (and a number of other formats). To do this, we need to use the read_sf function and point it to the shapefile we are interested in, in this case one of US population.

Let’s read it in and take a look:

us_pop = read_sf('./data/polygon/US_pop.shp')
us_pop 
## Simple feature collection with 52 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -179.1489 ymin: 17.88328 xmax: 179.7785 ymax: 71.36516
## Geodetic CRS:  NAD83
## # A tibble: 52 × 5
##    GEOID NAME        population popltnM                                 geometry
##    <chr> <chr>            <dbl>   <dbl>                       <MULTIPOLYGON [°]>
##  1 35    New Mexico     2097021      NA (((-109.0502 31.48, -109.0498 31.49951,…
##  2 72    Puerto Rico    3255642      NA (((-65.23805 18.32167, -65.23467 18.321…
##  3 06    California    39346023      NA (((-118.6044 33.47855, -118.5988 33.477…
##  4 01    Alabama        4893186      NA (((-88.05338 30.50699, -88.05109 30.508…
##  5 13    Georgia       10516579      NA (((-81.27939 31.30792, -81.27716 31.311…
##  6 05    Arkansas       3011873      NA (((-94.61792 36.49941, -94.61765 36.499…
##  7 41    Oregon         4176346      NA (((-123.6647 46.24431, -123.649 46.2484…
##  8 28    Mississippi    2981835      NA (((-88.50297 30.21523, -88.49176 30.209…
##  9 08    Colorado       5684926      NA (((-109.0603 38.59933, -109.0595 38.719…
## 10 49    Utah           3151239      NA (((-114.053 37.59278, -114.0525 37.6047…
## # … with 42 more rows

Note again the existence of the geometry column, this time holding polygons. Also note is has been assigned a crs of NAD83. Let’s see what it looks like:

mapview(us_pop,
        zcol="population")

We can use these polygon features to calculate spatial properties like area. Here we are creating a new column called area and assigning it the area of each polygon as calculated by st_area. As before we will be using $ to create a new column to store the area of each polygon.

us_pop$area = st_area(us_pop)  
head(us_pop$area)
## Units: [m^2]
## [1] 314982234921   9034739290 409670549693 133970789773 152659318908
## [6] 137795052265

In turn we can use this to calculate population density and assign it to a new column pop_den.

us_pop$pop_den = us_pop$population / us_pop$area
mapview(us_pop,zcol="pop_den")

Weird! I don’t see anything. There are two reasons for this. First, we calculated population density as (population / meter\(^2\)), which creates very very small numbers (and probably some rounding errors), and second, we have an outlier - Zoom into Washington DC (the best city in America). Washington is different because it is both a ‘state’ and a city - so its population density is an outlier - and is messing up the color scheme. Everything else is so much lower, they are all getting assigned the color purple.

Let’s try to fix this. First we will recalculate area but in population / km\(^2\), and let’s plot it again.

us_pop$area = as.numeric(st_area(us_pop) / (1000*1000) )
us_pop$pop_den =  us_pop$population / us_pop$area
mapview(us_pop,zcol="pop_den")

The number range on the legend looks better, but the color scheme isn’t working yet. Let’s look at the number range of pop_den. To get a sense of where the number are.

range(us_pop$pop_den)
## [1]    0.4870624 3970.7326150

Maybe it will be more helpful to look at the histogram. We will use the hist function to create a histogram and breaks is assigned to 30 to create lots of bins (allows us to see more detail in the distribution).

hist(us_pop$pop_den, breaks = 30)

Ok it looks like we have to assign a color scheme that ranges from 0 to somewhere around 500. In this case we are going to use the at parameter to set the location of color breaks. To do this we are going to create a sequence of numbers (using seq) that starts at 0 and ends at 500, and counts by 50.

seq(from = 0, to= 500, by=50)
##  [1]   0  50 100 150 200 250 300 350 400 450 500
mapview(us_pop, 
        zcol="pop_den",
        at = seq(from = 0, to= 500, by=50),
        label="NAME")

Amazing! Looks pretty good. Now let’s assume we want to subset a few columns of interest and write out a new shapefile. We can do this, as before using subset to select the columns of interest, and st_write to write the US_subset back out as a shapefile. Note I am using delete_dsn which allows you to overwrite any existing data with the same name.

US_subset = subset(US, select = c(City,State,pop_den, Lat,Lon ))
st_write(US_subset, "./data/polygon/US_pop_update.shp",delete_dsn=T)
## Deleting source `./data/polygon/US_pop_update.shp' failed
## Writing layer `US_pop_update' to data source 
##   `./data/polygon/US_pop_update.shp' using driver `ESRI Shapefile'
## Writing 6 features with 5 fields without geometries.

Let’s do one last thing. We often want to subset individual rows from a large dataset based on some condition. The package dplyr has lots of great tools for data maninpulation like this. In this case we will use the filter() function. For instance in this case we will subset rows where the Lat column is holding a value greater than 40.

northern = US_subset %>% filter(Lat > 40)
northern
##       City    State  pop_den   Lat    Lon
## 1 New York New York 28178.08 40.66 -73.93
## 2  Chicago Illinois 11842.58 41.83 -87.68

Or we can select a row based on a name, for checking for equality notice the use of == instead of =:

Chicago =  US_subset %>% filter(City == 'Chicago')
Chicago
##      City    State  pop_den   Lat    Lon
## 1 Chicago Illinois 11842.58 41.83 -87.68

Want to learn more about R basics?

Swirl has a variety of lessons on methods in R. These are interactive lessons and cover everything from basic data structures, to data visualization.

if(!require("swirl")) install.packages("swirl")
library(swirl)
install_course("R Programming")
swirl()

Additional titles are available https://swirlstats.com/scn/title.html here.